Module 1

Predictive Analytics Problem Definition

Different Types of Analytics

Descriptive Analytics

Focuses on insights from the past and answers the question “What happened?”

Predictive Analytics

Focuses on the future and addresses “What might happen next?”

Prescriptive Analytics

Suggests decision option such as “What is the best course of action” or “what will happen if I do this”.

Common Characteristics of Predictive Modeling Problems

  • A business issue can be identified and defined.
  • The issue can be addressed with a few well defined questions.
  • Lots of good and useful data can be used to answer these questions.
  • The predictions will drive actions or increase understanding.
  • It is better than any existing approach.
  • The model can be continuously monitored and updated as new data is made available.

Model Complexity, Bias, and Variance

Bias

The expected loss arising from the model not being complex/flexible enough to capture the underlying signal.

High bias means that our model won’t be accurate because it doesn’t have the capacity to capture the signal in the data.

Variance

The expected loss arising from the model being too complex and overfitting to a specific instance of the data.

High variance means that our model won’t be accurate because it overfit to the data it was trained on and, thus, won’t generalize well to new, unseen data. To understand and measure variance, we will need training and testing samples.

Problem Definition

Recipe to Effective Problem Definition

Clarity

Clarity of the business issue. The key to achieving clarity is asking the right questions.

Hypothesis

Testable Hypothesis to guide the project. This will also give management a clearer picture of what can be expcted from the project.

KPI

An ability to assess the outcome based on clear and measurable KPI

Evaluate and Prioritze

Module 2

Granularity

The more granular a variable, the less observations each level will have. For example, year will have less levels but more observations than month or day.

Data Exploration

We will use SOA Mortality data for this section. All illustrations are based on this dataset.

Univariate Data Exploration

The techniques used to explore the distribution of variables depend on the type of variable(numeric vs categorical). The two types of techniques are summary statistics and data visualization.

Summary Statistics

Used to find the mean/median and percentiles of a variable.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   33.00   43.00   42.97   54.00   94.00 
 Variance    StdDev       IQR 
230.93738  15.19662  21.00000 

You can tell skewness from this. If the mean > median then there is right skew while if the mean < median there is left skew. The example above is very symmetrical.

Histograms

Used to show the shape of the distribution of a numeric variables.

Box Plots

Used to determine if outliers exist in a numeric variable.

Note: Both histograms and box plots should show a visual representation of the summary statistics.

Frequency Tables and Bar charts

Both are used to get a sense of the distribution of a categorical variable. Since categorical variables are unordered, summary statistics won’t work here.

Frequency tables are only good when there aren’t too many levels in the variable. With higher levels, bar charts should be used to more easily visualize the distribution.

Exposure Summary
prodcat exposure_cnt exposure_cnt_p exposure_face exposure_face_p count count_p
TRM 19,353 4.47% 3,911,705,808 9.32% 26,788 5.36%
UL 87,728 20.24% 15,656,468,107 37.31% 111,932 22.39%
ULSG 32,918 7.60% 10,476,583,759 24.96% 49,688 9.94%
WL 293,374 67.70% 11,922,500,099 28.41% 311,592 62.32%

Bivariate Data Exploration

There are three combinations of variables to explore: Numeric vs Numeric, Categorical vs Categorical, and Numeric vs Categorical. There are many visualization tools that can be used for bivariate exploration:

Split Boxplot

Used to look at the distribution of a numeric variable split by a categorical variable.

Stacked/Split Histogram

This is also used to look at the distribution of a numeric variable split by a categorical variable. It is only suitable when using categorical variables with a low number of levels. Else a split box plot should be used.

Stacked Bar Chart

Used to look at the distribution of a categorical variable split by another categorical variable.When there are too many levels the chart becomes difficult to interpret.

Scatter Plots

Used to see the relationship between two variables(whether numeric or categorical). See the examples below:

The below shows scatterplots based on subsets of the data.

Module 3

Feature Generation and Selection

Variable vs Feature

A variable is one of the many columns in the original dataset. A feature is either an original variable selected to be used in a model or a feature generated from a transformed variable to be used in a model.

A variable could be stock price, while a feature generated from that variable could be change in stock price over some time period or average stock price over some period.

Non-Linear Relationships

A term-life data set is used here.

Identifying Non-Linear Relationships

The below scatter plot shows that there exists no linear relationships between the continuous variables of this dataset.

Transformation to Adjust for Nonlinear Relationships

A log transformation can be applied to skewed (often financial) data to address the non-linear issue.

The below is a scatter with two of the variables log transformed. After transformations, the points are more spread out so that patterns, if any, can be identified.

Note: When modeling with transformed variables (as in this case), it is important to remember to transform the resulting predictions back to un-transformed numbers.

Principal Components Analysis

Definition

PCA is a dimensionality reduction technique that transforms a large set of correlated variables into a smaller set of uncorrelated features called principal components, while retaining as much of the original variation in the data as possible. It can only be applied on numeric data. Categorical variables have to be converted beforehand.

Principal components are linear combinations of the original variables.

Why is it necessary?

  • To reduce the amount of variables relative to observations in the data. Good for curse of dimensionality.
  • Reduce or eliminate multicollinearity. Highly correlated predictors cause unstable coefficient estimates and inflated standard errors.

Downsides

  • Loss of interpretability since principal components are weighted combinations of the original variables.
  • PCA only captures linea relationships in the data
  • Variables with larger ranges dominate the principal components. If you have “age” (0-100) and “claim count” (0-50) and “premium” (0-100,000), premium will drive the components unless you standardize first.
  • Extreme values can distort the principal components since PCA is based on variance/covariance matrices. A few large claims could skew your entire transformation.

Analyzing PCA output

We will look at data based on diamonds. As mentioned PCA can only be done on numerical variables. Let’s look at the numerical variables in the diamonds data.

Diamonds Data
carat depth table price x y z
0.23 61.5 55 326 3.95 3.98 2.43
0.21 59.8 61 326 3.89 3.84 2.31
0.23 56.9 65 327 4.05 4.07 2.31
0.29 62.4 58 334 4.20 4.23 2.63
0.31 63.3 58 335 4.34 4.35 2.75

We will use carat, depth, x, y and z to conduct PCA.

The two main outputs from PCA is the summary and the loadings.

For the summary focus on the proportion of variance for each PC mainly and the standard deviation of each PC secondarily.

Importance of components:
                          PC1    PC2     PC3     PC4     PC5
Standard deviation     1.9890 1.0051 0.17862 0.03540 0.02591
Proportion of Variance 0.7912 0.2020 0.00638 0.00025 0.00013
Cumulative Proportion  0.7912 0.9932 0.99962 0.99987 1.00000

The rotations show how each variable contributes to the PCs. The sum of squares of the loadings for each PC should equal 1.

              PC1          PC2         PC3          PC4          PC5
carat -0.49668722  0.005903109 -0.86783797  0.009262406 -0.006199172
depth -0.01409718  0.994550858  0.01548542 -0.007080449 -0.101881938
x     -0.50129255 -0.050645176  0.28094608 -0.747021349 -0.330407707
y     -0.50111275 -0.055294983  0.29683507  0.660069453 -0.471196063
z     -0.50069438  0.072189161  0.28209166  0.078303871  0.811410290

Biplots

PCA can be visualized using biplots.

K-Means Clustering

Definition

K-Means is an unsupervised learning algorithm that partitions data into K distinct non-overlapping clusters. K is specified upfront. K=5 or K=10 is usually the default. Needs standardization prior to clustering. Designed for continuous variables.

Why it’s used?

  • Find natural groupings in data that we may not have known existed.
  • Replace individual observations with their cluster membership(basically classification), reducing complexity.
  • Observations that don’t fit well into any clusters may be outliers or anomolies.

Advantages

  • Has high interpretability. Very easy to explain to non-technical stakeholders.
  • Guaranteed convergence; the algorithm will always find a solution.
  • Clusters can become categorical variables.

Disadvantages

  • Must specify K in advance which may require trial and error.
  • Sensitive to initialisation due to the random setting of initial clusters. The algorithm needs to be run multiple times with different seeds for best practice.
  • Assumes spherical clusters and struggles with elongated clusters or nested clusters.
  • Sensitive to outliers.
  • Needs standardization.

Algorithm

1. Initialize

Randomly place K centroids in the data space. We can select K by using the elbow method (creating an elbow plot).

2. Assignment

Assign each observation to the nearest centroid based on Euclidead distance.

3. Update

Recalculate each centroid as the mean of all observations assigned to that cluster.

4. Repeat

Keep updating between steps 2 and 3 until the centroids stabalize.

The algorithm minimizes “within-cluster sum of squares”—basically making each cluster as tight and homogeneous as possible.

Hierarchical Clustering

Definition

An unsupervised learning method that builds a hierarchy of clusters in a tree like structure(dendrogram) that shows how the observations are grouped together at different levels of similarity.

Why is it used?

  • Exploratory Analysis: When you don’t know how many groupings may exist in your data, the dendrogram shows all possibilities

  • Nested Segmentation.

Advantages

  • No need to specify K.
  • The tree structure is easy to interpret and explain.
  • No random initialization like K-Means.
  • Captures nested structures

Disadvantages

  • Computational complexity.
  • Sensitive to outliers.
  • Requires cutting the dendrogram.

Algorithm

1. Start

There are two methods:

  • Agglomerative: Treat each observation as it’s own cluster. (N observations = N clusters)
  • Divisive : Start with all observations in a single cluster.

2. Merge/Split

  • Agglomerative: Merge the two closest clusters.
  • Divisive : Split into smaller clusters.

The distance here is based on linkage method for measuring distance between clusters which will be discussed below.

3. Update

Recalculate distance between the new cluster and all other clusters.

4. Repeat

  • Agglomerative: Keep merging the closest clusters till there’s only one cluster.
  • Divisive : Keep splitting clusters until all observations are in their clusters.

5. Cut

Decide where to cut the dendrogram to get the final k clusters.

Linkage Methods

The method of calculating distance between clusters.

Single Linkage

Distance between closest points in two clusters.

Complete Linkage

Distance between farthest points.

Average Linkage

Average distance between all pairs of points.

Ward’s Linkage

Minimizes within cluster variance. Similar to k-means.

Module 4

GLM Assumptions

  • The observations of the target variable are independent.
  • The target variable’s distribution is of the exponential family.
  • the expected value of the target variable is \(μ = g^{−1}(η)\), η = Xβ, where g is called the link function, and \(g^{−1}\) is its inverse.

Commonly used Distributions

Logistic

  • Used for binary data.

Poisson

  • Used for count data.
  • Assumes the mean and variance of the target variable are equal(equidispersion).
    • Overdispersion happens when the variance of the target > mean. Use negative binomial distribution.

Gamma/Inverse Gaussian

  • Used for continuous positive-value data.
Gamma
  • Highly flexible, handles moderately right skewed data where data cannot be zero. Use when large values are possible but rare(eg. large claims).
Inverse Gaussian
  • Used for extremely right skewed data.
  • Also cannot handle zeros in the data.

Tweedie

  • A mix between poisson and gamma where the variance power is between 1 and 2.
  • Can handle zeros in the data. Used when zeros are mixed with positive continuous data.

Framework for selecting a distribution

What is the data type of the target variable?

  • Binary: Logistic
  • Probabilistic: Binomial
  • Count: Poisson or Negative Binomial
  • Continuous: Normal(any value) or Gamma/Gaussian(positive values only)

What is the shape of the data?

If target variable is count type, check dispersion: - Equidispersion(variance=mean): Use poisson. - Overdispersion(variance>mean): Use negative binomial

If target variable is positive continuous:

  • Look at the shape of the data
    • Symmetric: Normal distribution
    • Right-skewed: Gamma/Inverse Gaussian
  • Look at outliers
    • No outliers, relatively bounded: Normal
    • Yes outliers, heavy tail - Gamma

Framework for Distribution Selection

Quick Decision Flowchart


Step 1: What is the data type of the target variable?

Binary (0/1, Yes/No)

  • Distribution: Binomial
  • Common link: Logit
  • Interpretation: Coefficients represent log-odds; exp(β) = odds ratio
  • Example: Lapsed (yes/no), Claimed (yes/no), Died (yes/no)

Proportion/Probability (between 0 and 1)

  • Distribution: Binomial (with number of trials)
  • Common link: Logit
  • Example: Lapse rate, proportion of portfolio claiming

Count (0, 1, 2, 3, …)

  • Go to Step 2a: Check dispersion
  • Base distributions: Poisson, Negative Binomial

Continuous - Any Value (-∞ to +∞)

  • Distribution: Normal (Gaussian)
  • Common link: Identity
  • Interpretation: Coefficients are additive; β = direct unit change
  • Example: Investment returns, temperature, residuals

Continuous - Positive Only (>0)

  • Go to Step 2b: Check for zeros and shape
  • Base distributions: Gamma, Inverse Gaussian, Tweedie, Normal

Step 2a: For Count Data - Check Dispersion

Calculate the variance-to-mean ratio from your data.

Ratio ≈ 1 (between 0.8 and 1.2)

  • Distribution: Poisson
  • Interpretation: Equidispersion (variance = mean)
  • Link: Log (multiplicative effects)
  • Coefficient interpretation: exp(β) = multiplicative effect
    • β = 0.05 → 5% increase per unit

Ratio > 1.2

  • Distribution: Negative Binomial
  • Interpretation: Overdispersion (variance > mean) - most common in insurance
  • Link: Log (multiplicative effects)
  • Why this happens: Unobserved heterogeneity (some policyholders are inherently riskier)
  • Example: Auto claim frequency - most drivers have 0 claims, some have many

Excess Zeros (>60-70% zeros AND poor model fit)

  • Consider: Zero-Inflated Poisson (ZIP) or Zero-Inflated Negative Binomial (ZINB)
  • Why: Two processes - structural zeros (will never claim) + count process
  • Choose ZIP vs ZINB: Same logic as Poisson vs NB based on dispersion

Step 2b: For Continuous Positive Data - Check Zeros and Shape

Data Contains Zeros (e.g., many policies with $0 claims)

Distribution: Tweedie (compound Poisson-Gamma, with 1 < p < 2)

  • Handles point mass at zero + continuous positive values
  • Link: Log (multiplicative effects)
  • Single model for total claim cost
  • Alternative: Two-part model (Binomial for zero/non-zero, then Gamma for amount given non-zero)
  • Tradeoff: Tweedie is simpler but less interpretable than two-part

Data is Strictly Positive (No Zeros)

Check the shape of the distribution:

Symmetric around the mean

  • Distribution: Normal
  • Even though positive-only, if symmetric and bounded, Normal works
  • Check: Mean ≈ Median, bell-shaped
  • Link: Identity (additive effects)
  • Example: Test scores, performance ratings, standardized measurements

Right-Skewed (long tail to the right)

  • Distribution: Gamma (default) or Inverse Gaussian
  • Check: Mean > Median, tail extends far right
  • Link: Log (multiplicative effects)
  • Coefficient interpretation: exp(β) = multiplicative effect
  • Example: Claim amounts, house prices, income

Choosing between Gamma and Inverse Gaussian:

  • Default: Gamma (more common, flexible, well-understood)
  • Theoretical duration/time context: Inverse Gaussian
  • For PA exam: Both are acceptable GLM-compatible options

Summary Table

Target Variable Type Distribution Link Coefficient Interpretation
Binary (Yes/No) Binomial Logit exp(β) = Odds Ratio
Proportion (0-1) Binomial Logit exp(β) = Odds Ratio
Count (variance ≈ mean) Poisson Log exp(β) = Multiplicative
Count (variance > mean) Negative Binomial Log exp(β) = Multiplicative
Count (excess zeros) ZIP/ZINB Log exp(β) = Multiplicative
Continuous (any value) Normal Identity β = Additive
Continuous (positive, symmetric) Normal Identity β = Additive
Continuous (positive, skewed) Gamma Log exp(β) = Multiplicative
Continuous (positive, with zeros) Tweedie Log exp(β) = Multiplicative

Random Notes

Deviance is a measure of goodness of fit of a GLM(similar to sum of squares). The default value is the null deviance which is the deviance measure when the target is predicted using the sample mean(similar to Total Sum of Squares).

Residual Plots

Residual vs fitted plots check the homogeneity of the variance and the linearity of the relationship.

Interpreting Residual results

Offsets and Weights

Offsets

Offsets are coefficients that are already known and so do not need to be estimated. Offsets handle known differences in exposure or scale across observations. When your response variable represents counts or totals accumulated over different time periods, geographical areas or population size, you can’t directly compare them without accounting for these differences, hence the use of offsets.

Example If modelling claims counts and one policy holder had coverage for .5 years while another had coverage for 1.5 years, comparing their raw claim counts would be misleading. The offset adjusts for this.

Weights

Prior weights are used when data is aggregated into a single record line. Prior weights basically tell the modeler to treat this record as x amount of observations.

Effect on Model Fitting Prior weights affect the deviance and degrees of freedom:

  • Weighted deviance: Each observation’s contribution is multiplied by it’s weight.
  • Degrees of Freedom: n = sum of weights, not number of rows.
  • Standard Errors: Smaller when weights are larger (more data behind each estimate).

Disagreggated data(3 rows)

Age Claims Exposure
25 1 1.0
25 0 1.0
25 2 1.0

Aggregated data(1 row)

Age Claims Exposure Weight
25 3 3.0 3

Difference between Offsets and Weights

The difference between these is their roles in the model structure.

  • Offset goes into the linear predictor; weight modifies the likelihood contribution.
  • Offset is part of the model specification; weight is part of of the data structure.
  • Offset represents exposure/opportunity for an event; weight represents replicated events.
  • Offset is a continuous measure of scale. Weight is a discrete count of replications.
  • Offset answers what is the scale/exposure for this observation. Weight answers the question how many observations does this row represent.

Notes on Regularization

Observations should be standardized before regularization.

Hyperparameters - Lambda(\(\lambda\)): penalty parameter - Alpha(\(\alpha\)): proportion between ridge and lasso regression; used in elastic net.

Decision Trees

Below is an example of a basic decision tree. In each node: - The top label is the predicted class - The middle number is the predicted probability of the majority class (in this case M) - The bottom number is the percentage of observations in the node

Impurity Measures for Classification Trees

Impurity measures how “mixed” the classes are within a node. A pure node contains observations from only one class. The goal is to reduce impurity at each split.

Three main impurity metrics are used:

Gini Index

\(Gini(N) = 1 - \sum_{i=1}^c p_i^2\)

Entropy

\(Entropy(N) = - \sum_{i=1}^c p_i log_2 (p_i)\)

Information gain

Classification Error Rate

\(Classification Error(N) = 1 - max_{i=1..c} p_i\)

Decision Tree Control Parameters

Minsplit

This controls the minimum number of observation that must exist in a node to split said node. The evaluation is done prior to the split.

Minbucket

This controls the minimum number of observations that must exist in the new leaf node after a split is done. The evaluation is done after the split and so if the evaluation is not passed, then the split is reversed.

Complexity Parameter (Cp)

This controls the minimum impurity reduction required for a split to be made. If Cp is .01 and the error reduction is less than .01, then the split won’t be made.

Below is an example of a Cp Table. We look for the lowest xerror which is highlighted in the table below.

Complexity Parameter Table
CP nsplit rel error xerror xstd
0.7919 0 1.0000 1.0000 0.0648
0.0604 1 0.2081 0.3087 0.0428
0.0268 2 0.1477 0.2550 0.0394
0.0201 4 0.0940 0.2685 0.0403
0.0134 6 0.0537 0.2282 0.0374
0.0067 7 0.0403 0.2215 0.0369
0.0000 8 0.0336 0.2081 0.0359

Max depth

This controls how many many levels of nodes are allowed in a tree. The root node is counted as depth 0 with the child nodes of this node counted as depth 1 and so on.

Cost Complexity Pruning

After determing the most optimal Cp, we can create a new tree from scratch using the Cp value. One downside to this method however, is that sometimes a good split may come after a bad split but our model will never reach these because the bad split did not pass the Cp requirement.

One way to avoid this is by building a fully complex tree then pruning backwards. This is called cost complexity pruning.

Cost complexity pruning begins with a full tree(using a 0 Cp value) then removes the least important splits according to the optimal Cp value.

Regression Trees

It is grown similarly to the classification tree. Instead of using impurity measure like Gini, we use RSS.

Variable Importance

Variable importance shows the ordering of variables according to their contribution to the model.

Variable Importance Table

Variable Importance
concave.points_mean 134.5302
perimeter_worst 113.6826
radius_worst 113.3666
concave.points_worst 112.9353
concavity_mean 109.4282
concavity_worst 90.3203
area_worst 25.7681
texture_worst 17.4773
area_mean 14.3656
perimeter_mean 14.3656

Variable Importance Plot

Model Assessment in Classification

Confusion Matrix

For classification, the simplest form of model assessment is using a confusion matrix.

Example
Actual Positive Actual Negative
Predicted Positive TP FP
Predicted Negative FN TN
Example
High Low
High 128 9
Low 346 466

From the confusion matrix we can derive several performance measures: accuracy, precision, sensitivity/recall.

Accuracy

The proportion of all predictions (both positive and negative) that were correct.

\(accuracy = \frac{TP + TN}{N}\)

Precision

The proportion of positive predictions that were actually positive. When positive is predicted, how often is it right?

\(precision = \frac{TP}{TP+FP}\)

Sensitivity

The proportion of actual positives that were classed correctly. Out of all actual positives, how many were caught?

\(sensitivity = \frac{TP}{TP+FN}\)

Specificity

The proportion of actual negatives that were classed correctly.

\(specificity = \frac{TN}{TN+FP}\)

ROC Curve (Receiver Operator Characteristic Curve)

Another form of model assessment for classification is the ROC curve. This curve compares the true positive rate(sensitivity) and the false positive rate(1 - sensitivity).

A cut-off value is used as a threshold. Such that if a cut off value is .8 and a node has .75 Yes and .25 No, it will still be classed as No even though the majority class is Yes.

Ensemble Trees

Random Forrest

Ensembles many decision trees built on bootstrapped samples and random subsets of predictors. This reduces variance by averaging many independent trees, while keeping bias similar to a single deep tree.

Random Forrest Algorithm 1. Training a. Get a random sample of observations(with replacement) from the training data. b. For each split, choose among a random sample of of features to determine that split(without replacement). c. Train the decision tree on the above. d. Repeat

  1. Predicting
  1. Predict the target using each previously trained tree.
  2. Average the predictions.
Random Forrest Parameters
Number of Trees

The number of trees to grow. The more trees the better, especially for datasets with a large number of observations or predictors.

Proportion of Observations

The proportion of observations in each random sample to build each individual tree. Must ensure each observations is used at least once. So if the proportion of observations is small then the number of trees need to be larger.

Proportion of features

The proportion of features to be used at each split. This parameter is usually tuned as part of the model fitting process.

Gradient Boosting Machine

Builds trees sequentially, where each new tree attempts to correct the errors of the previous ones. This reduces bias, but can increase variance, making tuning/regularization important.